Multivariable Linear Regression Analysis

Authors

MOHD FITTRI FAHMI BIN FAUZI (23202545)

MOHD FAUZIE BIN ISMAIL (23202542)

AHMAD SHAHRIL HAFIFI BIN SAAD (23202563)

Published

January 4, 2026

1 Part A: Introduction and data generation

Quality of Life (QoL) is a multidimensional construct critical to public health surveillance and intervention planning. Understanding the interplay between occupational factors, lifestyle behaviors, and metabolic health is essential for developing targeted health promotion strategies. This study utilizes simulated epidemiological data to investigate the determinants of QoL in a cohort of 200 individuals.

To ensure reproducibility and statistical rigor, data for 200 observations (n=200) were simulated using R (version 4.5.1). The simulation design incorporates bounded continuous variables, categorical risk factors, and specific interaction effects to mimic realistic population health data:

  • Bounded Covariates via Beta Distributions: Unlike normal distributions (which are unbounded) or uniform distributions (which assume equal probability), we utilized Beta(2,2) distributions scaled to domain-specific ranges for Years Working (0–40) and Physical Activity (0–10). This choice reflects a realistic population structure where extreme values (e.g., 0 or 40 years of work) are less frequent than central values.

  • The “True” Structural Model: The outcome \(Y\) (QoL) was generated using a linear specification with a known interaction term: \[QOL = \beta_0 + \beta_1 X_{work} + \beta_2 X_{act} + \beta_3 X_{obese} + \beta_4 (X_{act} \cdot X_{obese}) + \epsilon\] where \(\epsilon \sim N(0, 8)\). The parameters were chosen to reflect clinical plausibility: a high baseline QoL (\(\beta_0=72\)), a strong negative main effect of obesity (\(\beta_3=-18\)), and a negative interaction (\(\beta_4=-1.8\)), implying that obesity attenuates the positive slope of physical activity.

  • Diagnostic Stress-Testing To rigorously test the sensitivity of our subsequent model diagnostics, we introduced a “stress-test” by manually injecting six influential observations. These points possess high leverage (extreme covariate combinations) and/or large residuals, specifically designed to challenge the detection capabilities of Cook’s Distance and DFBETAS.

# ============================================================
# Generate QoL dataset (n=200) with:
# - ranges: qol 0-100, years_working 0-40, phys_activity 0-10
# - obesity as factor with 2 levels
# - interaction: phys_activity * obesity
# - influential observations injected
# ============================================================

rm(list = ls())
set.seed(42)

n <- 200

# -----------------------------
# Covariates
# -----------------------------
years_working  <- rbeta(n, 2, 2) * 40          # 0-40
phys_activity  <- rbeta(n, 2, 2) * 10          # 0-10
obesity        <- sample(c("obese", "not obese"),
                        size = n, replace = TRUE,
                        prob = c(0.35, 0.65))

obese_bin <- ifelse(obesity == "obese", 1, 0)

# -----------------------------
# Outcome with interaction
# -----------------------------
beta0    <- 72
beta_yr  <- -0.6
beta_pa  <-  4.2
beta_ob  <- -18.0
beta_int <- -1.8
sigma    <-  8

err <- rnorm(n, mean = 0, sd = sigma)

qol <- beta0 +
  beta_yr  * years_working +
  beta_pa  * phys_activity +
  beta_ob  * obese_bin +
  beta_int * phys_activity * obese_bin +
  err

qol <- pmin(pmax(qol, 0), 100)

data <- data.frame(
  qol = round(qol, 1),
  years_working = round(years_working, 1),
  phys_activity = round(phys_activity, 1),
  obesity = obesity,
  stringsAsFactors = FALSE
)

# -----------------------------
# Inject influential observations (high leverage + big residuals)
# -----------------------------
influentials <- data.frame(
  qol = c(95, 15, 90, 10, 5, 98),
  years_working = c(40, 0, 39.5, 38.8, 1.0, 0.5),
  phys_activity = c(0, 10, 9.8, 0.2, 9.9, 0.1),
  obesity = c("obese", "not obese", "obese", "not obese", "obese", "not obese"),
  stringsAsFactors = FALSE
)

data[1:nrow(influentials), ] <- influentials

# Shuffle rows (so influentials not clustered at top)
set.seed(123)
data <- data[sample(seq_len(n)), ]
row.names(data) <- NULL

# Save as CSV
write.csv(data, "qol_data.csv", row.names = FALSE)

1.1 Glimpse the generated dataset

Code
library(dplyr)
glimpse(data)
Rows: 200
Columns: 4
$ qol           <dbl> 90.9, 48.6, 48.7, 93.9, 71.2, 86.5, 80.4, 77.2, 76.8, 91…
$ years_working <dbl> 29.0, 16.1, 27.0, 17.9, 24.1, 21.0, 15.2, 11.2, 19.1, 12…
$ phys_activity <dbl> 6.4, 5.1, 8.2, 7.2, 4.2, 4.5, 2.9, 3.9, 3.4, 8.7, 8.6, 4…
$ obesity       <chr> "not obese", "obese", "obese", "not obese", "not obese",…

1.2 Setting up the Environment

1.2.1 Load Required Libraries

library(tidyverse)
library(dplyr)
library(broom)
library(modelr)
library(ggplot2)
library(GGally)
library(patchwork)
library(knitr)
library(broom.helpers)

1.2.2 Transform data

# Change obesity to factors and set reference level
data <- data |> 
  mutate(obesity = factor(obesity,levels = c("not obese", "obese")))
glimpse(data)
Rows: 200
Columns: 4
$ qol           <dbl> 90.9, 48.6, 48.7, 93.9, 71.2, 86.5, 80.4, 77.2, 76.8, 91…
$ years_working <dbl> 29.0, 16.1, 27.0, 17.9, 24.1, 21.0, 15.2, 11.2, 19.1, 12…
$ phys_activity <dbl> 6.4, 5.1, 8.2, 7.2, 4.2, 4.5, 2.9, 3.9, 3.4, 8.7, 8.6, 4…
$ obesity       <fct> not obese, obese, obese, not obese, not obese, not obese…

1.2.3 Label variables

library(labelled)
# Add labels to variables
var_label(data$qol) <- "Quality of Life Score"
var_label(data$years_working) <- "Years Working"
var_label(data$phys_activity) <- "Physical Activity Score"
var_label(data$obesity) <- "Obesity Status"

2 Part B : Exploratory Data Analysis

2.1 Summary Statistics

Code
library(summarytools)
print(dfSummary(data,  
        style        = "grid",  
        varnumbers   = FALSE,
        valid.col    = FALSE
        ),
  method = "render")

Data Frame Summary

data

Dimensions: 200 x 4
Duplicates: 0
Variable Label Stats / Values Freqs (% of Valid) Graph Missing
qol [numeric] Quality of Life Score
Mean (sd) : 70.6 (18.1)
min ≤ med ≤ max:
5 ≤ 72.8 ≤ 100
IQR (CV) : 24.9 (0.3)
168 distinct values 0 (0.0%)
years_working [numeric] Years Working
Mean (sd) : 19.4 (9.2)
min ≤ med ≤ max:
0 ≤ 19.4 ≤ 40
IQR (CV) : 13.6 (0.5)
155 distinct values 0 (0.0%)
phys_activity [numeric] Physical Activity Score
Mean (sd) : 4.9 (2.3)
min ≤ med ≤ max:
0 ≤ 4.8 ≤ 10
IQR (CV) : 3.6 (0.5)
83 distinct values 0 (0.0%)
obesity [factor] Obesity Status
1. not obese
2. obese
131 ( 65.5% )
69 ( 34.5% )
0 (0.0%)

Generated by summarytools 1.1.4 (R version 4.5.1)
2026-01-04

Comment: The dataset consists of 200 observations with complete data (0.0% missingness) across all variables. The outcome variable, qol, exhibits a negative skew with a mean of 70.6 (SD=18.1) and spans the full theoretical range of 0 to 100. Continuous covariates display means of 19.4 years (SD=9.2) for years_working and 4.9 (SD=2.3) for phys_activity, with distributions approximating uniformity and normality respectively. The obesity factor indicates a prevalence of 34.5% (n=69).

2.2 Visualizations

2.2.1 Histograms for Continuous Variables

Code
# Select continuous variables
continuous_vars <- data |> 
  select(where(is.numeric))

# Create histograms for all continuous variables
continuous_vars |> 
  pivot_longer(everything(), names_to = "variable", values_to = "value") |> 
  ggplot(aes(x = value)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal(base_size = 16) +
  labs(title = "Distribution of Continuous Variables",
       x = "Value",
       y = "Frequency") +
  theme(
    plot.title = element_text(size = 19, face = "bold"),
    axis.title = element_text(size = 16),
    strip.text = element_text(size = 16, face = "bold")
  )
Figure 1: Distribution of Continuous Variables

Comment:The phys_activity scores exhibit an approximately symmetric distribution centered near the scale midpoint. The outcome variable, qol, demonstrates a pronounced negative skew, characterized by a high concentration of observations at the upper bound. The years_working variable displays a distribution spanning the complete 0 to 40-year range without distinct skewness.

2.2.2 Box Plots of Continuous Variables by Categorical Variables

Code
# Create box plots for continuous variables by obesity status
continuous_vars |> 
  bind_cols(data |> select(obesity)) |> 
  pivot_longer(-obesity, names_to = "variable", values_to = "value") |> 
  ggplot(aes(x = obesity, y = value, fill = obesity)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~variable, scales = "free_y") +
  theme_minimal(base_size = 16) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Box Plots of Continuous Variables by Obesity Status",
       x = "Obesity Status",
       y = "Value") +
  theme(
    plot.title = element_text(size = 19, face = "bold"),
    axis.title = element_text(size = 16),
    strip.text = element_text(size = 16, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )
Figure 2: Box Plots of Continuous Variables by Obesity Status

Comment:The Quality of Life score (qol) reveals a distinct separation, with the non-obese group exhibiting a substantially higher median and elevated interquartile range compared to the obese group. Conversely, Years Working (years_working) and Physical Activity score (phys_activity) demonstrate considerable overlap between groups, indicating similar distributions regardless of metabolic status. Outliers are noted in the lower extremes of the Quality of Life score distribution for both groups, identifying specific observations that require monitoring during residual analysis

2.2.3 Scatter Plots of Continuous Variables

Code
p_qol_years <- ggplot(data, aes(x = years_working, y = qol)) +
  geom_point(color = "#2E86AB", alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE,
              color = "#2E86AB", fill = scales::alpha("#2E86AB", 0.2)) +
  theme_minimal() +
  labs(title = "Quality of Life vs Years Working",
       x = "Years Working",
       y = "Quality of Life")

p_qol_phys <- ggplot(data, aes(x = phys_activity, y = qol)) +
  geom_point(color = "#A23B72", alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE,
              color = "#A23B72", fill = scales::alpha("#A23B72", 0.2)) +
  theme_minimal() +
  labs(title = "Quality of Life vs Physical Activity",
       x = "Physical Activity",
       y = "Quality of Life")

p_years_phys <- ggplot(data, aes(x = phys_activity, y = years_working)) +
  geom_point(color = "#F18F01", alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE,
              color = "#F18F01", fill = scales::alpha("#F18F01", 0.2)) +
  theme_minimal() +
  labs(title = "Years Working vs Physical Activity",
       x = "Physical Activity",
       y = "Years Working")

# Combine scatter plots with patchwork
p_qol_years + p_qol_phys + p_years_phys
Figure 3: Scatter Plots with Linear Fits

Comment: A negative linear association is observed between Quality of Life and Years Working, where increased work duration correlates with lower quality of life scores. In contrast, Physical Activity demonstrates a positive linear relationship with Quality of Life, indicating that higher activity levels are associated with better outcomes. The relationship between the covariates Years Working and Physical Activity is weak and lacks a distinct trend, suggesting low potential for multicollinearity between these predictors in the multivariate model.

2.2.4 Bar Plots for Categorical Variables

Code
ggplot(data, aes(x = obesity, fill = obesity)) +
  geom_bar() +
  theme_minimal() +
  labs(title = "Distribution of Obesity Status",
       x = "Obesity Status",
       y = "Count") +
  scale_fill_brewer(palette = "Set2")
Figure 4: Distribution of Categorical Variables

Comment:The not obese group represents the majority of observations (65.5%, n=131), while the obese group comprises the remaining 34.5% (n=69). Despite this imbalance, the minority group maintains a sufficient sample size to support robust parameter estimation and interaction testing.

2.3 Correlation Matrix using corrplot

Code
library(corrplot)
# Compute correlation matrix
cor_matrix <- cor(continuous_vars, use = "complete.obs")
# Plot correlation matrix
corrplot(cor_matrix, method = "color", type = "upper",
         tl.col = "black", tl.srt = 45,
         addCoef.col = "black", number.cex = 0.7,
         title = "Correlation Matrix of Continuous Variables",
         mar = c(0,0,1,0))
Figure 5: Correlation Matrix of Continuous Variables

Comment: Quality of Life exhibits a weak positive linear association with Physical Activity (r=0.24) and a weak negative association with Years Working (r=−0.21). The correlation between the covariates Years Working and Physical Activity is low (r=−0.04), indicating minimal linear dependence between predictors. This lack of strong association between independent variables supports the validity of including both in the multivariate model without significant risk of multicollinearity.

2.4 Ridgeline Plot: QOL Distribution Across Physical Activity Levels

Code
library(ggridges)

# Create physical activity categories
data_ridges <- data |>
  mutate(
    activity_category = cut(
      phys_activity,
      breaks = c(-Inf, 2, 4, 6, 8, Inf),
      labels = c("Very Low (0-2)", "Low (2-4)", "Moderate (4-6)", 
                 "High (6-8)", "Very High (8-10)"),
      include.lowest = TRUE
    )
  )

# Create ridgeline plot
ggplot(data_ridges, aes(x = qol, y = activity_category, fill = obesity)) +
  geom_density_ridges(
    alpha = 0.7,
    scale = 1.2,
    rel_min_height = 0.01,
    quantile_lines = TRUE,
    quantiles = 2
  ) +
  scale_fill_manual(
    values = c("not obese" = "#1a9850", "obese" = "#d73027"),
    name = "Obesity Status"
  ) +
  facet_wrap(~ obesity, ncol = 2) +
  theme_minimal(base_size = 13) +
  labs(
    title = "Quality of Life Distribution Across Physical Activity Levels",
    subtitle = "Stratified by Obesity Status (Median lines shown)",
    x = "Quality of Life Score",
    y = "Physical Activity Level",
    fill = "Obesity Status"
  ) +
  theme(
    legend.position = "bottom",
    strip.text = element_text(face = "bold", size = 12),
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    panel.grid.minor = element_blank()
  ) +
  xlim(0, 100)
Figure 6: Distribution of QOL Across Physical Activity Levels by Obesity Status

Comment: The ridgeline plot reveals striking distributional shifts in Quality of Life across physical activity levels, with distinct patterns emerging between obesity groups that hints at different effect of physical activity on QOL based on obesity status. Among non-obese individuals, QOL distributions systematically shift rightward as physical activity increases, exhibiting a clear dose-response relationship where higher activity is associated with not only elevated central tendency but also reduced variability (narrower, more peaked distributions). In contrast, the obese group demonstrates substantially attenuated distributional shifts across activity levels, with broader, more overlapping distributions that remain concentrated at lower QOL values even at maximal physical activity engagement. Most critically, the visualization reveals that the gap between obesity groups widens as physical activity increases: obese individuals cannot “catch up” to their non-obese counterparts despite equivalent activity levels.

3 Part C: Regression Analysis

3.1 Univariable linear regression models

3.1.1 Years working

Code
yr_work <- lm(qol ~ years_working, data = data)
tidy(yr_work, conf.int = TRUE) |>
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))
glance(yr_work)
Table 1: Univariate Regression: Quality of Life by Years Working
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 78.4741719 2.9266133 26.813987 <0.001 72.7028393 84.2455044
years_working -0.4046148 0.1364131 -2.966098 0.0034 -0.6736239 -0.1356057
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.0425427 0.0377071 17.72125 8.79774 0.0033871 1 -857.7356 1721.471 1731.366 62180.49 198 200

Interpretation: Years working demonstrates a statistically significant negative association with QOL (β = -0.40, 95% CI: -0.67, -0.14). For each additional year of work experience, quality of life decreases by approximately 0.40 points on average. However, years working alone explains only 4.3% of the variance in QOL (R² = 0.043), indicating that while statistically significant, occupational tenure is not a strong univariate predictor. This modest R² suggests that other factors potentially including lifestyle behaviors and health status are important determinants of quality of life.

3.1.2 Physical Activity

Code
phys_act <- lm(qol ~ phys_activity, data = data)
tidy(phys_act, conf.int = TRUE) |>
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))
glance(phys_act) 
Table 2: Univariate Regression: Quality of Life by Physical Activity
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 61.419390 2.9421208 20.875890 <0.001 55.6174768 67.221304
phys_activity 1.887124 0.5463226 3.454231 <0.001 0.8097665 2.964482
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.0568361 0.0520727 17.58848 11.93171 0.000675 1 -856.2315 1718.463 1728.358 61252.23 198 200

Interpretation: Physical activity exhibits a significant positive association with QOL (β = 1.89, 95% CI: 0.81, 2.96). Each one-unit increase in physical activity score is associated with approximately a 1.89 point increase in quality of life. Physical activity explains 5.7% of the variance in QOL (R² = 0.057), higher than the explanatory power of years working. This suggests that modifiable lifestyle behaviors may be more influential determinants of well-being than occupational factors in isolation.

3.1.3 Obesity

Code
obesity_uv <- lm(qol ~ obesity, data = data)
tidy(obesity_uv, conf.int = TRUE) |>
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))
glance(obesity_uv)
Table 3: Univariate Regression: Quality of Life by Obesity Status
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 79.26565 1.188259 66.70738 <0.001 76.92238 81.60892
obesityobese -25.03232 2.023027 -12.37370 <0.001 -29.02176 -21.04287
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.4360715 0.4332234 13.60025 153.1083 0 1 -804.8003 1615.601 1625.495 36623.41 198 200

Interpretation: Obesity status demonstrates the strongest univariate association with QOL among all predictors. Obese individuals report, on average, 25.0 points lower quality of life compared to non-obese individuals (β = -25.03, 95% CI: -29.02, -21.04). Obesity alone explains 43.6% of the variance in QOL (R² = 0.436), substantially exceeding the explanatory power of either years working or physical activity in isolation. This robust effect size and high R² underscore obesity as a major health determinant and potential target for intervention, though the univariate estimate may be confounded by unmeasured factors including physical activity levels and other health behaviors.

3.1.4 Summary of univariable analysis

Code
library(gtsummary)
library(gt)
var_labels <- list(
  qol = "Quality of Life",
  years_working = "Years Working",
  phys_activity = "Physical Activity Score",
  obesity = "Obesity Status"
)

# Extract R-squared from each model
r2_years <- glance(yr_work)$r.squared
r2_phys <- glance(phys_act)$r.squared
r2_obesity <- glance(obesity_uv)$r.squared

# Create R-squared table
r2_table <- tibble(
  variable = c("years_working", "phys_activity", "obesity"),
  r.squared = c(r2_years, r2_phys, r2_obesity)
)

tbl_uv <- data |>
  select(qol, years_working, phys_activity, obesity) |>
  tbl_uvregression(
    method = lm,
    y = qol,
    exponentiate = FALSE,
    conf.level = 0.95,
    pvalue_fun = ~style_pvalue(.x, digits = 3, eps = 0.001),
    label = var_labels,
    estimate_fun = ~style_number(.x, digits = 2)
  ) |>
  modify_header(label = "**Variables**") |>
  modify_table_body(
    ~.x |>
    left_join(r2_table, by = "variable") |>
    relocate(r.squared, .after = p.value) |>
    mutate(r.squared = style_number(r.squared, digits = 3))
  ) |>
  modify_header(r.squared = "**R²**") |>
  as_gt() |>
  tab_options(table.width = pct(100))
tbl_uv
Table 4: Summary Table of Univariable Regression Models
Variables N Beta 95% CI p-value
Years Working 200 -0.40 -0.67, -0.14 0.003 0.043
Physical Activity Score 200 1.89 0.81, 2.96 <0.001 0.057
Obesity Status 200


0.436
    not obese

0.436
    obese
-25.03 -29.02, -21.04 <0.001 0.436
Abbreviation: CI = Confidence Interval

Synthesis: The univariate analyses reveal that all three predictors are significantly associated with quality of life, but with varying effect sizes and explanatory power. Obesity emerges as the strongest single predictor (R² = 0.436), followed by physical activity (R² = 0.057) and years working (R² = 0.043).

3.2 Multivariable linear regression models

3.2.1 Model 1: Main effects only

Code
model1 <- lm(qol ~ years_working + phys_activity + obesity, data = data)
tidy(model1, conf.int = TRUE) |>
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))
Table 5: Multivariable Regression Model 1: Main Effects Only
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 76.3070101 2.9096068 26.225884 <0.001 70.5688546 82.0451655
years_working -0.3295846 0.0967254 -3.407426 <0.001 -0.5203406 -0.1388285
phys_activity 1.9035607 0.3899422 4.881648 <0.001 1.1345395 2.6725819
obesityobese -24.8633411 1.8679630 -13.310403 <0.001 -28.5472279 -21.1794542

Interpretation: The main effects model (Model 1) reveals that after adjusting for all predictors, years working (β = -0.33) and physical activity (β = 1.90) retain statistical significance with effect sizes almost similar to univariate models. However, the obesity coefficient is changed to -24.86 (compared to -25.03 univariately), suggesting partial confounding by physical activity and work duration. All predictors demonstrate independent associations with QOL, supporting their inclusion in the model.

3.2.2 Model 2: Interaction between years physical activity and obesity

Code
model2 <- lm(qol ~ years_working + phys_activity + obesity + phys_activity:obesity, data = data)
tidy(model2, conf.int = TRUE) |>    
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))
Table 6: Multivariable Regression Model 2: With Interaction Term
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 72.1488446 3.3274270 21.683074 <0.001 65.5864796 78.7112095
years_working -0.3285842 0.0954853 -3.441201 <0.001 -0.5169008 -0.1402677
phys_activity 2.7553012 0.5163079 5.336546 <0.001 1.7370366 3.7735658
obesityobese -15.4872990 4.2127503 -3.676292 <0.001 -23.7957021 -7.1788958
phys_activity:obesityobese -1.9162788 0.7741386 -2.475369 0.014 -3.4430381 -0.3895196

Interpretation: Model 2 introduces an interaction term between physical activity and obesity, yielding a statistically significant interaction coefficient (β = -1.92, p = 0.014). This negative interaction indicates that the beneficial effect of physical activity on QOL is diminished among obese individuals. Specifically, while non-obese individuals gain 2.76 QOL points per unit increase in physical activity, obese individuals gain only 0.84 points (2.76 - 1.92 = 0.84). The significance of this interaction term suggests that obesity status is an important effect modifier, not merely a confounder, and that separate interpretation of effects by obesity stratum is warranted.

3.2.3 Compare Model 1 and Model 2

Formal model comparison using F-test and information criteria helps determine whether the additional complexity introduced by the interaction term is justified by improved model fit. We evaluate both statistical significance and practical model performance metrics.

3.2.3.1 Using anova function (Partial F test)

Code
anova(model1, model2)
Table 7: Partial F-test Comparing Model 1 and Model 2
Res.Df RSS Df Sum of Sq F Pr(>F)
196 30840.76 NA NA NA NA
195 29901.18 1 939.5801 6.127453 0.0141627

Interpretation: The test (Kamarul Imran, Wan Nor Arifin, and Tengku Muhammad Hanis Tengku Mokhtar, n.d.) yields a statistically significant result (p = 0.014), providing strong evidence that Model 2 (with interaction) fits the data significantly better than Model 1 (main effects only). The reduction in residual sum of squares indicates that accounting for the differential effect of physical activity by obesity status explains additional variance in quality of life beyond what is captured by additive main effects alone.

3.2.3.2 Using performance package

Code
library(performance)
compare_performance(model1, model2, rank = TRUE)
Table 8: Model Performance Comparison Between Model 1 and Model 2
Name Model R2 R2_adjusted RMSE Sigma AIC_wt AICc_wt BIC_wt Performance_Score
model2 lm 0.5395805 0.5301360 12.22726 12.38303 0.8903113 0.8840082 0.609394 1
model1 lm 0.5251129 0.5178442 12.41788 12.54396 0.1096887 0.1159918 0.390606 0

Interpretation: The performance comparison confirms the superiority of Model 2 across multiple criteria. Model 2 demonstrates higher R² (0.539 vs 0.525) and adjusted R² indicating better explanatory power. The lower AIC and BIC values for Model 2 further support its selection, as these information criteria explicitly penalize model complexity while rewarding fit. The RMSE is also marginally lower in Model 2, indicating more accurate predictions. These metrics provide supportive evidence that the interaction model offers superior balance between parsimony and fit, justifying its selection as the preliminary final model (Lüdecke et al. 2021).

3.2.4 Preliminary final model

Code
prelim_tbl <- tbl_regression(
    model2,
    intercept = TRUE,
    conf.level = 0.95,
    pvalue_fun = ~style_pvalue(.x, digits = 3, eps = 0.001),
    label = var_labels,
    estimate_fun = ~style_number(.x, digits = 2)
) |>
    add_glance_table(
        include = c(r.squared, adj.r.squared, AIC, nobs)
    )
prelim_tbl
Table 9: Preliminary Final Model
Characteristic Beta 95% CI p-value
(Intercept) 72.15 65.59, 78.71 <0.001
Years Working -0.33 -0.52, -0.14 <0.001
Physical Activity Score 2.76 1.74, 3.77 <0.001
Obesity Status


    not obese
    obese -15.49 -23.80, -7.18 <0.001
Physical Activity Score * Obesity Status


    Physical Activity Score * obese -1.92 -3.44, -0.39 0.014
0.540

Adjusted R² 0.530

AIC 1,581

No. Obs. 200

Abbreviation: CI = Confidence Interval

4 Part D: Model Checking

4.1 Check assumptions

4.1.1 Generate predicted values and residuals

data.fitted <- data <- augment(model2)

4.1.2 Assumption 1: Linearity

4.1.2.1 Overall linearity

Code
# Residuals vs Fitted plot
plot(model2, which = 1)
Figure 7: Residuals vs Fitted Values Plot

Interpretation: The Residuals vs. Fitted plot demonstrates a generally linear relationship, with the red smoothing line remaining close to the zero residual line across the majority of the fitted value range. However, a slight upward trend in the smoothing line is observed as fitted values exceed 90, suggesting minor systematic deviations at the upper extreme of predicted Quality of Life scores. Furthermore, several extreme outliers are identified: observation 159 exhibits a substantial negative residual approaching -100, while observations 122 and 136 also deviate significantly from the central cluster. These specific data points require further investigation as they may disproportionately influence the model’s parameter estimates and overall fit (Fox and Weisberg 2019) .

4.1.2.2 Linearity for each numerical predictor

Code
years.resid <- ggplot(data.fitted, aes(x = years_working, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", color = "blue") +
  theme_minimal() +
  labs(title = "Residuals vs Years Working",
       x = "Years Working",
       y = "Residuals")
phys.resid <- ggplot(data.fitted, aes(x = phys_activity, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", color = "blue") +
  theme_minimal() +
  labs(title = "Residuals vs Physical Activity",
       x = "Physical Activity",
       y = "Residuals")

years.resid + phys.resid
Figure 8: Residuals vs Individual Predictors

Interpretation: The LOESS smoothing lines remain relatively horizontal and centered near zero throughout the central mass of the data, suggesting that the linear functional form is an appropriate specification for the majority of the sample. However, notable deviations occur at the distribution boundaries: a downward curvature is evident at the lower extreme of Years Working (near 0 years) and the upper extreme of Physical Activity score (near 10), while an upward trend appears at the highest levels of work experience. These patterns are primarily driven by specific influential cases, such as observation 159, which possesses an extreme negative residual that pulls the smoothing line downward at the tails.

4.1.3 Assumption 2: Normality of Residuals

Code
par(mfrow = c(1, 2))
plot(model2, which = 2)
hist.resid <- hist(data.fitted$.resid, breaks = 30,
                   main = "Histogram of Residuals",
                   xlab = "Residuals",
                   col = "lightblue",
                   border = "black")               
Figure 9: Q-Q Plot and Histogram of Residuals

Interpretation: The Q-Q plot reveals that while the majority of residuals follow the theoretical diagonal, there are significant departures at both extremes, most notably in the lower tail where observation 159 exhibits an extreme negative deviation. This is further supported by the histogram, which displays a distribution characterized by a heavy negative skew and isolated residuals extending beyond -80, along with a slight positive skew at the upper bound (e.g., observation 136). These heavy tails indicate a violation of the normality assumption, suggesting that the presence of these extreme outliers may bias the standard errors and affect the reliability of the model’s hypothesis tests (Harrell , 2015).

4.1.4 Assumption 3: Homoscedasticity

4.1.4.1 Visual diagnostics

Code
par(mfrow = c(1, 2))
plot(model2, which = c(1,3))
Figure 10: Residuals and Scale-Location Plots

Interpretation: The Scale-Location plot provides a standardized assessment of the homoscedasticity assumption by plotting the square root of absolute standardized residuals against fitted values. The red smoothing line remains relatively flat across the lower and middle ranges of predicted Quality of Life scores, indicating that residual variance is consistent for most participants. A slight upward trend is observed at the highest fitted values, primarily driven by influential observations such as 159 and 136, which exhibit larger variances than the central data cluster. While these specific outliers increase local variability, the lack of a strong, systematic “fan” or “funnel” shape suggests that the assumption of constant variance (homoscedasticity) is sufficiently met for valid parametric inference (Harrell , 2015) .

4.1.4.2 Breusch-Pagan Test

Code
library(lmtest)
bptest(model2)

    studentized Breusch-Pagan test

data:  model2
BP = 4.2319, df = 4, p-value = 0.3755

Comment: Since the p-value exceeds the conventional alpha level of 0.05, we fail to reject the null hypothesis of constant variance (Breusch and Pagan 1979). This result indicates that, despite the slight patterns noted in the diagnostic plots, there is no statistically significant evidence of heteroscedasticity in the model residuals.

4.1.5 Assumption 4: Independence of Residuals

Independence is primarily ensured by the study design: each participant contributes a single, unrelated observation, and there is no clustering, repeated measures, or temporal ordering that would induce serial correlation. We then complement this design-based justification with residuals vs observation order and the Durbin-Watson test as a formal check for autocorrelation (Durbin and Watson 1950).

4.1.5.1 Visual diagnostics

Code
plot(residuals(model2), type = "o")
abline(h = 0, col = "red", lty = 2)
Figure 11: Residuals Over Observation Order

Interpretation: No obvious patterns or trends in the plot, suggesting that the residuals are independent (Department of Statistics Penn. State University 2018).

4.1.5.2 Durbin-Watson Test

Code
library(lmtest)
dwtest(model2)

    Durbin-Watson test

data:  model2
DW = 2.1229, p-value = 0.8104
alternative hypothesis: true autocorrelation is greater than 0

Interpretation: P-value > 0.05, indicates that we fail to reject the null hypothesis of no autocorrelation, suggesting that the residuals are independent (Durbin and Watson 1950).

4.1.6 Assumption 5: No multicollinearity

4.1.6.1 Using VIF for model without interaction

Code
library(car)
vif_values <- vif(model2, type = "predictor")
kable(vif_values)
Table 10: Variance Inflation Factor (VIF) for Multicollinearity Assessment
GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
years_working 1.00345 1 1.001724 phys_activity, obesity
phys_activity 1.00345 3 1.000574 obesity years_working
obesity 1.00345 3 1.000574 phys_activity years_working

Interpretation:Multicollinearity was assessed using generalized variance inflation factors (GVIFs) from the full regression model, including the interaction between physical activity and obesity. Because the interaction term introduces multiple model-matrix columns associated with each predictor, predictor-level GVIFs were computed and interpreted using the adjusted measure GVIF (Fox and Weisberg 2019).

4.2 Checking influentials

4.2.1 Cook’s distance

We use Cook’s distance to identify influential observations that may disproportionately affect the model’s estimates. The threshold for Cook’s distance is commonly set at 4/n, where n is the number of observations in the dataset.

Code
cutoff <- 4/(nrow(data))
plot(model2, which = 4, cook.levels = cutoff)
Figure 12: Cook’s Distance for Influential Observations

Interpretation: We identify observations with Cook’s distance greater than the cutoff value as influential points that may warrant further investigation.Let’s look at the 3 data points with highest Cook’s distance.

Code
data[c(122, 136, 159), ] 
Table 11: Details of Top Three Influential Observations Based on Cook’s Distance
qol years_working phys_activity obesity .fitted .resid .hat .sigma .cooksd .std.resid
5 1 9.9 obese 64.63928 -59.63928 0.0882294 11.57676 0.4923603 -5.043857
95 40 0.0 obese 43.51818 51.48182 0.0894181 11.79519 0.3727960 4.356799
15 0 10.0 not obese 99.70186 -84.70186 0.0732850 10.68756 0.7985183 -7.105472

These data must be checked for data entry errors or other issues. If no issues are found, we may consider refitting the model without these points to assess their impact.

4.2.2 Residuals vs Leverage plot

Code
plot(model2, which = 5)
Figure 13: Residuals vs Leverage Plot

Interpretation: The Residuals vs. Leverage plot identifies observation 159 as a highly influential outlier, falling near the critical Cook’s distance D=1 contour due to its combination of high leverage and an extreme negative standardized residual. Additionally, observations 136 and 122 approach or exceed the D=0.5 threshold, suggesting they also possess unusual covariate profiles that disproportionately affect the model estimates. The red smoothing line exhibits an upward curvature at high leverage values, confirming that these specific data points are distorting the regression coefficients and likely biasing the model fit.

4.2.3 Using DFBETAS

We use threshold of 2/sqrt(n) to identify influential observations for each predictor.

Code
# Compute DFBETAS and display as a datatable
library(DT)
dfb <- dfbetas(model2)
dfb_long <- dfb |> 
  as.data.frame() |> 
  mutate(id = row_number()) |> 
  pivot_longer(
    cols = -id,
    names_to = "term",
    values_to = "dfbeta"
  )

# Cutoff
dfb_threshold <- 2 / sqrt(nrow(data))

# Identify influential observations
influential_dfb <- dfb_long |> 
  filter(abs(dfbeta) > dfb_threshold) |>
  arrange(desc(abs(dfbeta)))
influential_dfb %>%
  datatable()
Table 12: DFBETAS for Each Observation and Predictor (Descending Order)

Interpretation: Analysis of DFBETAS statistics highlights specific observations exerting disproportionate influence on parameter estimates. Observation 159 demonstrates the most substantial impact, shifting the regression coefficient for phys_activity by approximately -1.79 standard errors and the years_working coefficient by +1.20 standard errors. Furthermore, this observation biases the interaction term (phys_activity:obesityobese) by +1.22 standard errors, suggesting it contradicts the general relationship between activity and obesity status established by the rest of the data. Observations 122 and 136 also display significant influence on the interaction estimates, with DFBETAS values of -0.98 and -0.82 respectively, indicating that these points distinctly alter the modeled interaction effect.

4.2.4 Using standardized residuals and leverage values cutoff

We will identify data points with standardized residuals greater than 2 or less than -2, as well as those with leverage values exceeding the calculated cutoff. Leverage cutoff is calculated as 2*(p+2)/n, where p is the number of predictors and n is the number of observations.

Code
leverage_cutoff <- 2 * (length(coef(model2)) + 2) / nrow(data)
data.fitted %>% 
  mutate(id = row_number()) %>%
  filter(.std.resid > 2 | .std.resid < -2 | .hat > leverage_cutoff) %>%
  datatable()
Table 13: Outliers Identified by Standardized Residuals and Leverage

We have identified 6 potential outliers based on standardized residuals and leverage cutoff. It also included the 3 influential points identified earlier using Cook’s distance and DFBETAS (observations 159, 122, 136).

Code
# Plot standardized residuals vs leverage
ggplot(data.fitted, aes(x = .hat, y = .std.resid)) +
  geom_point(alpha = 0.8, size = 2.5, color = "#2E86AB") +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "red") +
  geom_vline(xintercept = leverage_cutoff, linetype = "dashed", color = "red") +
  labs(
    title = "Standardized Residuals vs Leverage",
    x = "Leverage (.hat)",
    y = "Standardized Residuals (.std.resid)",
    color = "Observation Status"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )
Figure 14: Standardized Residuals vs Leverage Plot

Interpretation: The plot of standardized residuals against leverage values highlights several observations that exceed the established thresholds, indicating potential outliers. Specifically, few data points fall outside the acceptable range defined by standardized residuals greater than ±2 or leverage values exceeding the calculated cutoff.

4.3 Remedial measures

4.3.1 Remove outliers and refit model

data.nooutliers <- data.fitted %>%
  filter(.std.resid < 2 & .std.resid > -2 & .hat < leverage_cutoff)

model_removed <- lm(qol ~ years_working + phys_activity + obesity + phys_activity:obesity, data = data.nooutliers)
tidy(model_removed, conf.int = TRUE) |> mutate(p.value = format.pval(p.value, digits = 3, eps = 0.001))
glance(model_removed)
Table 14: Refitted Model After Removing Outliers
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 73.9881650 2.1237673 34.838169 <0.001 69.7988321 78.1774980
years_working -0.5425106 0.0619837 -8.752472 <0.001 -0.6647793 -0.4202418
phys_activity 3.3940677 0.3332031 10.186184 <0.001 2.7367930 4.0513424
obesityobese -18.2019000 2.7097173 -6.717269 <0.001 -23.5470749 -12.8567250
phys_activity:obesityobese -1.6008753 0.5057194 -3.165541 0.0018 -2.5984548 -0.6032958
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.7966093 0.7923047 7.486639 185.0615 0 4 -663.2866 1338.573 1358.18 10593.4 189 194

4.3.2 Compare model fitness between preliminary final model with model after removing outliers

  • Preliminary model
Code
model_performance(model2)
Table 15: Performance Metrics for Preliminary Model
AIC AICc BIC R2 R2_adjusted RMSE Sigma
1581.043 1581.478 1600.833 0.5395805 0.530136 12.22726 12.38303
  • Refitted model after outlier removal
Code
model_performance(model_removed)
Table 16: Performance Metrics for Refitted Model
AIC AICc BIC R2 R2_adjusted RMSE Sigma
1338.573 1339.022 1358.18 0.7966093 0.7923047 7.389532 7.486639
  • Using compare_performance function
Code
compare_performance(model2, model_removed, rank = TRUE)
Table 17: Side-by-Side Performance Comparison of Models
Name Model R2 R2_adjusted RMSE Sigma AIC_wt AICc_wt BIC_wt Performance_Score
model_removed lm 0.7966093 0.7923047 7.389532 7.486639 1 1 1 1
model2 lm 0.5395805 0.5301360 12.227261 12.383029 0 0 0 0

Interpretation: After removing the identified outliers, we observe a significant improvement in model performance metrics such as R-squared and AIC. The refitted model demonstrates better explanatory power and a more parsimonious fit, indicating that the removal of outliers has positively impacted the model’s accuracy and reliability.

4.3.3 Compare coefficient changes after removing outliers

Code
coef_comparison <- tidy(model2) %>%
  select(term, estimate) %>%
  rename(estimate_prelim = estimate) %>%
  left_join(
    tidy(model_removed) %>%
      select(term, estimate) %>%
      rename(estimate_refit = estimate),
    by = "term"
  ) %>%
  mutate(
    change_in_estimate = estimate_refit - estimate_prelim,
    percent_change = (change_in_estimate / estimate_prelim) * 100
  )
coef_comparison
Table 18: Changes in Coefficients After Removing Outliers
term estimate_prelim estimate_refit change_in_estimate percent_change
(Intercept) 72.1488446 73.9881650 1.8393205 2.549342
years_working -0.3285842 -0.5425106 -0.2139263 65.105484
phys_activity 2.7553012 3.3940677 0.6387665 23.183181
obesityobese -15.4872990 -18.2019000 -2.7146010 17.527918
phys_activity:obesityobese -1.9162788 -1.6008753 0.3154036 -16.459169

Interpretation: The removal of influential outliers revealed notable changes in parameter estimates across all coefficients. The negative association between years_working and quality of life increased by 65% (from -0.33 to -0.54), while the beneficial effect of physical activity strengthened by 23% (from 2.76 to 3.39). Most notably, the obesity main effect increased by 18% (from -15.5 to -18.2), indicating obese individuals face an even larger quality of life deficit than initially estimated. Conversely, the interaction term weakened by 16% (from -1.92 to -1.60), suggesting that the preliminary model slightly overestimated the extent to which obesity dampens the benefits of physical activity. The intercept showed minimal change (+2.6%). While the directionality of all effects remained stable, these magnitude changes support the decision to rely on the refitted estimates for final inference, as they provide a more robust representation of the underlying data structure free from the leverage of outlier observations.

4.3.4 Compare model diagnostics after removing outliers

Code
par(mfrow = c(4, 2))

# Row 1: Residuals vs Fitted
plot(model2, which = 1, main = "Original Model")
plot(model_removed, which = 1, main = "Without Outliers")

# Row 2: Q-Q Plot
plot(model2, which = 2, main = "Original Model")
plot(model_removed, which = 2, main = "Without Outliers")

# Row 3: Scale-Location
plot(model2, which = 3, main = "Original Model")
plot(model_removed, which = 3, main = "Without Outliers")

# Row 4: Cook's Distance
plot(model2, which = 4, main = "Original Model")
plot(model_removed, which = 4, main = "Without Outliers")
Figure 15: Comparison of Diagnostic Plots Before and After Outlier Removal

Interpretation:

  • Linearity (Residuals vs Fitted): The original model displayed a slight U-shaped curvature in the smoothing line. In the adjusted model, the smoothing line is horizontal and centered near zero, suggesting that the linear specification now adequately captures the relationship between the predictors and the outcome variable.

  • Normality (Q-Q Residuals): The initial Q-Q plot showed significant deviation from the diagonal in the lower tail, driven by extreme negative residuals (e.g., Obs 17, 125). Post-exclusion, the standardized residuals align closely with the theoretical quantile line, indicating that the distribution of error terms is now approximately normal.

  • Homoscedasticity (Scale-Location): The downward trend in the original Scale-Location plot suggested heteroscedasticity, with residual variance decreasing as fitted values increased. The model without outliers exhibits more consistent spread of residuals across the range of fitted values, consistent with the assumption of constant variance.

  • Influential Observations (Cook’s Distance): Initial diagnostics identified observations 159, 122, and 136 as having high influence, with Cook’s distance values approaching or exceeding 0.8. In the revised model, the maximum Cook’s distance is reduced to below 0.05, confirming that the regression coefficients are no longer disproportionately determined by individual data points.

5 Part E: Result presentation and Interpretation

5.1 Final model

Code
library(gt)
final_tbl <- tbl_regression(
  model_removed,
  intercept = TRUE,
  conf.level = 0.95,
  pvalue_fun = ~style_pvalue(.x, digits = 3, eps = 0.001),
  label = var_labels,
  estimate_fun = ~style_number(.x, digits = 2)
) |>
  add_glance_table(
    include = c(r.squared, adj.r.squared, AIC, statistic),
    label = list(statistic = "F-statistic")
  ) |>
  bold_p(t = 0.05) |>
  bold_labels() |>
  modify_header(
    label = "**Variables**",
    estimate = "**Adj. Beta**",
    std.error = "**SE**",
    statistic = "**t-stat**"
  ) |>
  modify_column_unhide(c(std.error, statistic)) |>
  modify_table_body(
    ~ .x |> dplyr::relocate(std.error, .after = estimate) |>
           dplyr::relocate(statistic, .before = p.value)
  ) |> 
  as_gt() %>%
  opt_align_table_header(align = "left") %>%
  tab_options(
    table.border.top.style = "none",
    table.border.bottom.style = "none",
    heading.border.bottom.style = "solid",
    heading.border.bottom.color = "black",
    table_body.border.top.color = "black",
    table_body.border.bottom.style = "solid",
    table_body.border.bottom.color = "black",
    table_body.hlines.style = "none",
    column_labels.border.top.style = "solid",
    column_labels.border.top.width = px(2),
    column_labels.border.top.color = "black",
    column_labels.border.bottom.style = "solid",
    column_labels.border.bottom.width = px(2),
    column_labels.border.bottom.color = "black",
    data_row.padding = px(5),
    table.width = pct(100)
    ) %>%
    tab_style(
    style = cell_text(align = "right"),
    locations = cells_body(columns = -label)
    ) %>%
    opt_align_table_header(align = "left")
  final_tbl
Table 19: Final Multivariable Linear Regression Model for Quality of Life Score
Variables Adj. Beta SE 95% CI t-stat p-value
(Intercept) 73.99 2.12 69.80, 78.18 34.8 <0.001
Years Working -0.54 0.062 -0.66, -0.42 -8.75 <0.001
Physical Activity Score 3.39 0.333 2.74, 4.05 10.2 <0.001
Obesity Status




    not obese
    obese -18.20 2.71 -23.55, -12.86 -6.72 <0.001
Physical Activity Score * Obesity Status




    Physical Activity Score * obese -1.60 0.506 -2.60, -0.60 -3.17 0.002
0.797



Adjusted R² 0.792



AIC 1,339



F-statistic 185



Abbreviations: CI = Confidence Interval, SE = Standard Error

5.2 Final model equation

\[ \text{QOL} = 74.0 - 0.54\times\text{Years working} + 3.39\times\text{Physical Activity} - 18.20\times\text{Obese} - 1.60\,(\text{Physical Activity} \times \text{Obese}) + \varepsilon \]

5.3 Interpretation

1. Model Performance & Overall Fit

  • Goodness of Fit: The model explains a substantial proportion of the variance in Quality of Life scores, with an R² of 0.797 (Adjusted R² = 0.797). This indicates that approximately 79.7% of the variability in QOL scores is accounted for by years working, physical activity, obesity status, and their interactions.

  • Global Significance: The F-statistic of 185 is large, implying the overall model is highly statistically significant and provides a better fit than an intercept-only null model.

2. Interpretation of Coefficients (Main Effects & Interactions)

  • Intercept (Baseline Reference): The predicted QOL score is 74.0 (95% CI: 69.80, 78.18) for an individual who has worked 0 years, has a Physical Activity score of 0, and is not obese.

  • Years Working: Holding physical activity and obesity constant, for every one additional year of working, the QOL score decreases by 0.54 units (95% CI: -0.66, -0.42). Over a 40-year career, this would amount to a substantial cumulative decrease of approximately 21.6 points in QOL.

  • The Interaction Effect (Physical Activity × Obesity): The interaction term is statistically significant, indicating that the relationship between physical activity and QOL depends on obesity status. We cannot interpret the main effects of Physical Activity or Obesity in isolation; they must be interpreted conditionally.

  • Effect of Physical Activity (Stratified by Obesity Status):

    • For Non-Obese Individuals (Reference): Physical activity has a strong positive effect. For every 1-unit increase in physical activity score, QOL increases by 3.39 points (95%CI 2.74, 4.05).
    • For Obese Individuals: The beneficial effect of physical activity is significantly dampened by the interaction term (−1.60). The net effect is 3.39−1.60=1.79. Thus, for obese individuals, every 1-unit increase in physical activity increases QOL by only 1.79 points.
  • Effect of Obesity (Conditional on Physical Activity):

    • The coefficient for “Obesity: obese” (−18.2) specifically represents the difference in QOL between obese and non-obese individuals when the Physical Activity Score is 0.

    • At the lowest level of physical activity, obese individuals have a QOL score 18.2 points lower than non-obese individuals (95%CI -23.55, -12.86).

  • As physical activity increases, this gap widens further (due to the negative interaction term), meaning the disparity in QOL between obese and non-obese individuals grows larger at higher activity levels.

3. Practical Implications

  • The “Protective” Effect of Activity: Physical activity is a strong predictor of higher QOL for everyone, but the return on investment is significantly higher for non-obese individuals.

  • Vulnerability of the Obese Group: Obese individuals start with a lower baseline QOL (intercept adjustment) and gain QOL at a slower rate from physical activity (slope adjustment) compared to their non-obese counterparts.

  • Work-Life Balance: The consistent negative impact of years working suggests potential cumulative occupational fatigue or burnout affecting QOL, independent of health behaviors.

5.4 Interaction plot

Code
library(interactions)

# Create interaction plot using interact_plot
interact_plot(
  model_removed,
  pred = phys_activity,
  modx = obesity,
  colors = c("#2E86AB", "#A23B72"),
  x.label = "Physical Activity Score",
  y.label = "Predicted Quality of Life Score",
  legend.main = "Obesity Status",
  main.title = "Interaction Between Physical Activity and Obesity Status",
  modx.labels = c("Not Obese", "Obese")
) 
Figure 16: Interaction Between Physical Activity and Obesity Status on Quality of Life

Interpretation: The interaction plot demonstrates that while increased physical activity is positively associated with higher predicted Quality of Life for both groups, the magnitude of this benefit varies significantly by obesity status. The “Not Obese” group exhibits a steeper slope compared to the “Obese” group, indicating that non-obese individuals derive a greater marginal improvement in QoL for every unit increase in physical activity score. Consequently, the disparity in QoL between the two groups widens as physical activity increases; at low activity levels, the gap is moderate, but at high activity levels, the non-obese group significantly outperforms the obese group. Practically, this suggests that while physical activity promotion is a valid intervention for all, obesity may dampen the potential QoL gains from exercise, implying that public health interventions might need to pair physical activity with targeted weight management strategies to maximize quality of life outcomes for the obese population.

6 Prediction

6.1 Create new data for prediction

Code
new_data <- expand.grid(
  years_working = seq(0, 40, by = 10),
  phys_activity = seq(0, 10, by = 2),
  obesity = c("not obese", "obese")
)

new_data |> datatable()

Prediction Grid: Combinations of Predictor Values

Interpretation: The prediction grid systematically spans the full covariate range observed in the dataset, creating comprehensive combinations across work tenure levels, physical activity scores, and both obesity groups to examine model predictions across all meaningful clinical and occupational scenarios.

6.2 Generate predictions with confidence intervals using final model

Code
predicted <- augment(model_removed, newdata = new_data, interval = "confidence", level = 0.95)
predicted |> datatable()

Predicted Quality of Life Scores with 95% Confidence Intervals

Interpretation: The predicted QOL scores reveal substantial variability driven by the combined influence of occupational tenure, physical activity, and obesity status. The most vulnerable profile (obese individuals with maximum work tenure and minimal activity) shows severely compromised well-being, while the most favorable profile (non-obese individuals with minimal work history and maximum activity) approaches optimal QOL.

6.3 Prediction visualization

Code
library(ggplot2)
ggplot(predicted, aes(x = phys_activity, y = .fitted, color = obesity, fill = obesity)) +
  # Add confidence intervals (Ribbons)
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.2, color = NA) +
  # Add regression lines
  geom_line(linewidth = 1) +
  # Facet by years_working to show the relationship across career lengths
  facet_grid(~ years_working, labeller = label_both) + 
  scale_color_brewer(palette = "Set1") + 
  scale_fill_brewer(palette = "Set1") +
  theme_minimal(base_size = 12) +
  labs(
    title = "Predicted Quality of Life by Physical Activity and Obesity Status",
    subtitle = "Stratified by Years Working (Model Predictions with 95% CIs)",
    y = "Predicted QOL Score (0-100)",
    x = "Physical Activity Score (0-10)",
    color = "Group",
    fill = "Group"
  ) +
  theme(
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold")
  )
Figure 17: Predicted Quality of Life with Confidence Intervals

Interpretation: There is a robust positive linear relationship between physical activity and QOL, where increased physical activity consistently correlates with higher predicted QOL scores across all strata. Stratifying by years_working reveals a notable negative main effect of occupational tenure; as years working increases from 0 to 40, the baseline QOL systematically decreases for all individuals, suggesting a cumulative burden of long-term employment on well-being. Regarding the interaction between physical activity and obesity, the “not obese” group maintains a significantly higher QOL compared to the “obese” group across the entire range of activity, evidenced by the non-overlapping 95% confidence intervals. While the slopes for both groups are positive and relatively parallel, indicating that the beneficial rate of return on physical activity is similar regardless of weight status. The persistent intercept gap emphasizes that obesity acts as a significant independent suppressor of QOL, highlighting the public health necessity of promoting physical activity as a buffering mechanism, particularly for individuals with high occupational duration.

6.4 Prediction heatmap tool

Code
library(ggplot2)
library(broom)

# 1. Create prediction grid
# specific representative values to create the "blocks"
heatmap_data <- expand.grid(
  years_working = c(0, 10, 20, 30, 40),      # Y-axis snapshots
  phys_activity = c(0, 2.5, 5, 7.5, 10),     # X-axis snapshots
  obesity = c("not obese", "obese")          # Facet variable
)

# 2. Generate predictions
heatmap_pred <- augment(model_removed, newdata = heatmap_data)

# 3. Create the heatmap
ggplot(heatmap_pred, aes(x = as.factor(phys_activity), y = as.factor(years_working), fill = .fitted)) + 
  geom_tile(color = "white", lwd = 1.5) +
  geom_text(aes(label = round(.fitted, 1)), color = "white", fontface = "bold", size = 5) +
  facet_wrap(~ obesity) +
  scale_fill_gradient(low = "#d73027", high = "#1a9850", name = "Predicted\nQOL Score") +
  labs(
    title = "Predicted Quality of Life Score Heatmap",
    subtitle = "By Physical Activity and Years Working (Stratified by Obesity)",
    x = "Physical Activity Score",
    y = "Years Working"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid = element_blank(),        
    strip.text = element_text(face = "bold", size = 12),
    axis.text = element_text(color = "gray30"),
    legend.position = "right"
  )
Figure 18: Heatmap of Predicted Quality of Life Scores

Interpretation:This heatmap functions as a clinical risk prediction matrix to provide actionable risk assessments for patient counseling. By mapping the predicted QOL scores across the interaction of physical activity and obesity , stratified by years working, the tool isolates specific “high-risk” groups that require immediate intervention. The visualization identifies the most vulnerable subgroup - obese individuals with 40 years of workforce tenure and sedentary lifestyles (activity score 0) who present with a critical baseline QOL of 34.1. This matrix demonstrates the quantifiable benefit of lifestyle modification: a patient in the “non ibese/20-years-working” stratum can effectively “migrate” from a compromised QOL of 63.1 to a robust score of 97.1 by maximizing physical activity. This transforms the statistical interaction into a practical roadmap for targeted public health strategy, suggesting that resources should be prioritized for individuals in the lower-left quadrants (high tenure, low activity) where the “return on investment” for physical activity interventions is most critical.

7 Conclusion

This multivariable linear regression analysis demonstrates that Quality of Life is significantly influenced by years working, physical activity, obesity status, and the interaction between physical activity and obesity. The final model (R² = 0.797) reveals that longer occupational tenure is associated with decreased QOL, while physical activity confers protective benefits, though these benefits are substantially attenuated in obese individuals. After removing influential outliers, the model met all regression assumptions and demonstrated robust predictive validity. The significant interaction effect indicates that obesity not only independently reduces QOL but also dampens the positive impact of physical activity, with non-obese individuals gaining 1.60 additional QOL points per unit of activity compared to their obese counterparts. These findings underscore the importance of integrated interventions that combine physical activity promotion with weight management strategies, particularly for individuals with extended occupational tenure who face compounded QOL vulnerabilities.

8 Github Repository

GitHub Repository

The complete code and data for this analysis are available at:

https://github.com/drfittri/regressionanalysis_gdt500

9 References

Breusch, T. S., and A. R. Pagan. 1979. “A Simple Test for Heteroscedasticity and Random Coefficient Variation.” Econometrica 47 (5): 1287–94. https://doi.org/10.2307/1911963.
Department of Statistics Penn. State University. 2018. “Residuals Vs. Order Plot.” https://online.stat.psu.edu/stat462/node/121/.
Durbin, J., and G. S. Watson. 1950. “Testing for Serial Correlation in Least Squares Regression: i.” Biometrika 37 (3/4): 409. https://doi.org/10.2307/2332391.
Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. Third edition. Los Angeles London New Delhi Singapore Washington, DC Melbourne: SAGE.
Harrell , Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer Series in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-19425-7.
Kamarul Imran, Wan Nor Arifin, and Tengku Muhammad Hanis Tengku Mokhtar. n.d. Data Analysis in Medicine and Health Using r. https://bookdown.org/drki_musa/dataanalysis/binary-logistic-regression.html#models-comparison.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. Performance: An r Package for Assessment, Comparison and Testing of Statistical Models” 6: 3139. https://doi.org/10.21105/joss.03139.